Welcome everybody to the next part of deep learning. Today we want to finish talking
about common practices and in particular we want to have a look at the evaluation.
Machine learning is the science of sloppiness really.
So of course we need to evaluate the performance of the models that we've trained so far and
now we have said the training said the hyperparameters we all estimated this and now we want to evaluate
the generalization performance on previously unseen data.
This means the test data and it's time to open the vault.
Remember of all things the measure is man.
Humans are a low bar to exceed.
So data is annotated and labeled by humans and during training all labels are assumed
to be correct but of course to err is human.
All input is error.
Which means that in addition we may have a biggest data.
The ideal situation that you actually want to have for your data is that it has been
annotated by multiple human voters and then you can take the mean or a majority vote.
There's also a very nice paper by Stefan Steidl from 2005 and it introduces an entropy
based measure that takes into account the confusions of human reference labelers.
So this is very useful in situations where you have unclear labels in particular in
emotional recognition this is a problem and also humans confuse sometimes classes like
angry versus annoyed while they are not very likely to confuse angry versus happy.
So this is a very clear distinction but of course there's different degrees of happiness
sometimes you're just a little bit happy and then it makes it really difficult to differentiate
happy from neutral and this is also hard for humans.
So in prototypes if you have actors playing you get emotion recognition rates way over
90 percent but if you have real data emotion if you have emotions as they occur in daily
life it's much harder to predict.
So this can then also be seen in the labels and in the distribution of the labels.
If you have a prototype all of the raters will agree it's clearly this particular class.
If you have nuances and not so clear emotions you will see that also our raters will have
more or less uniform distribution over the labels because they also can't assess the
specific sample.
So mistakes by the classifier are obviously less severe if the same class is also confused
by humans and this is considered in this entropy based measure.
Now if we look into performance measure you want to take into account the typical classification
measures and they are typically built around the false negatives, the true negatives, the
true positives and the false positives.
And from that for binary classification problems you can then compute true false positive rates.
So this typically then leads to numbers like the accuracy that is the number of true positives
plus true negatives over the number of positives and negatives.
Then there is the precision or positive predictive value that is computed as the number of true
positives over the number of true positives plus false positives.
There's the so-called recall that is defined as the true positives over the true positives
plus the false negatives.
Specificity or true negative value is given as the true negatives over the true negatives
plus the false positives and the F1 score which is then somehow an intermediate way
mixing those different measures where you have the true positive value times the true
negative value divided over the sum of true positive and true negative value.
I typically recommend receiver operating characteristic curves because all of the measures that you've
seen above they are dependent on thresholds.
And if you have the ROC curves there you essentially evaluate your classifier for all different
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:28 Min
Aufnahmedatum
2020-05-16
Hochgeladen am
2020-05-17 01:36:15
Sprache
en-US
Deep Learning - Common Practices Part 4
This video discusses how to evaluate deep learning approaches.
Video References:
Lex Fridman's Channel
Further Reading:
A gentle Introduction to Deep Learning
References:
[1] M. Aubreville, M. Krappmann, C. Bertram, et al. “A Guided Spatial Transformer Network for Histology Cell Differentiation”. In: ArXiv e-prints (July 2017). arXiv: 1707.08525 [cs.CV].
[2] James Bergstra and Yoshua Bengio. “Random Search for Hyper-parameter Optimization”. In: J. Mach. Learn. Res. 13 (Feb. 2012), pp. 281–305.
[3] Jean Dickinson Gibbons and Subhabrata Chakraborti. “Nonparametric statistical inference”. In: International encyclopedia of statistical science. Springer, 2011, pp. 977–979.
[4] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–478.
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
[6] Boris T Polyak and Anatoli B Juditsky. “Acceleration of stochastic approximation by averaging”. In: SIAM Journal on Control and Optimization 30.4 (1992), pp. 838–855.
[7] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Functions”. In: CoRR abs/1710.05941 (2017). arXiv: 1710.05941.
[8] Stefan Steidl, Michael Levit, Anton Batliner, et al. “Of All Things the Measure is Man: Automatic Classification of Emotions and Inter-labeler Consistency”. In: Proc. of ICASSP. IEEE - Institute of Electrical and Electronics Engineers, Mar. 2005.